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Author's Abstract 



We present an algorithm, called Disk Paxos, for implementing a reliable 
distributed system with a network of processors and disks. Like the origi- 
nal Paxos algorithm, Disk Paxos maintains consistency in the presence of 
arbitrary non-Byzantine faults. Progress can be guaranteed as long as a 
majority of the disks are available, even if all processors but one have failed. 
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1 Introduction 



Fault tolerance requires redundant components. Maintaining consistency in 
the event of a system partition makes it impossible for a two-component 
system to make progress if either component fails. There are innumerable 
fault-tolerant algorithms for implementing distributed systems, but all that 
we know of equate component with processor. But there are other types of 
components that one might replicate instead. In particular, modern net- 
works can now include disk drives as independent components. Because 
commodity disks are cheaper than computers, it is attractive to use them as 
the replicated components for achieving fault tolerance. Commodity disks 
differ from processors in that they are not programmable, so we can't just 
substitute disks for processors in existing algorithms. 

We present here an algorithm called Disk Paxos for implementing an 
arbitrary fault-tolerant system with a network of processors and disks. It 
maintains consistency in the event of any number of non-Byzantine failures. 
That is, the algorithm tolerates faulty processors that pause for arbitrarily 
long periods, fail completely, and possibly restart; and it tolerates lost and 
delayed messages. Disk Paxos guarantees progress if the system is stable and 
there is at least one nonfaulty processor that can read and write a majority 
of the disks. Stability means that each processor is either nonfaulty or has 
failed completely, and nonfaulty processors can access nonfaulty disks. 

Disk Paxos is a variant of the classic Paxos algorithm [3, 10, 12], a simple, 
efficient algorithm that has been used in practical distributed systems [13, 
16]. Classic Paxos can be viewed as an implementation of Disk Paxos in 
which there is one disk per processor, and a disk can be accessed directly 
only by its processor. 

In the next section, we recall how to reduce the problem of implementing 
an arbitrary distributed system to the consensus problem. Section 3 infor- 
mally describes Disk Synod, the consensus algorithm used by Disk Paxos. 
It includes a sketch of an incomplete correctness proof and explains the rela- 
tion between Disk Synod and the Synod protocol of classic Paxos. Section 4 
briefly discusses some implementation details and contains the conventional 
concluding remarks. An appendix gives formal specifications of the consen- 
sus problem and the Disk Synod algorithm, and sketches a rigorous correct- 
ness proof. 
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2 The State-Machine Approach 



The state-machine approach [5, 14] is a general method for implementing 
an arbitrary distributed system. The system is designed as a deterministic 
state machine that executes a sequence of commands, and a consensus al- 
gorithm ensures that, for each n, all processors agree on the n th command. 
This reduces the problem of building an arbitrary system to solving the con- 
sensus problem. In the consensus problem, each processor p starts with an 
input value input[p], and all processors output the same value, which equals 
input [p] for some p. A solution should be: 

Consistent All values output are the same. 

Nonblocking If the system is stable and a nonfaulty processor can com- 
municate with a majority of disks, then the processor will eventually 
output a value. 

It has long been known that a consistent, nonblocking consensus algorithm 
requires a three-phase commit protocol [15], with voting, prepare to commit, 
and commit phases. Nonblocking algorithms that use fewer phases don't 
guarantee consistency. For example, the group communication algorithms 
of Isis [2] permit two processors belonging to the current group to disagree 
on whether a message was broadcast in a previous group to which they both 
belonged. This algorithm cannot, by itself, guarantee consistency because 
disagreement about whether a message had been broadcast can result in 
disagreement about the output value. 

The classic Paxos algorithm [3, 10, 12] achieves its efficiency by using 
a three-phase commit protocol, called the Synod algorithm, in which the 
value to be committed is not chosen until the second phase. When a new 
leader is elected, it executes the first phase just once for the entire sequence 
of consensus algorithms performed for all later system commands. Only the 
last two phases are performed separately for each individual command. 

In the Disk Synod algorithm, the consensus algorithm used by Disk 
Paxos, each processor has an assigned block on each disk. The algorithm 
has two phases. In each phase, a processor writes to its own block and reads 
each other processor's block on a majority of the disks. 1 Only the last phase 
needs to be executed anew for each command. So, in the normal steady- 
state case, a leader chooses a state-machine command by executing a single 
write to each of its blocks and a single read of every other processor's blocks. 

1 There is also an extra phase that a processor executes when recovering from a failure. 
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The classic result of Fischer, Lynch, and Patterson [4] implies that a 
purely asynchronous nonblocking consensus algorithm is impossible. So, 
real-time clocks must be introduced. The typical industry approach is to 
use an ad hoc algorithm based on timeouts to elect a leader, and then have 
the leader choose the output. It is easy to devise a leader-election algorithm 
that works when the system is stable, which means that it works most of 
the time. It is very hard to make one that always works correctly even when 
the system is unstable. Both classic Paxos and Disk Paxos also assume a 
real-time algorithm for electing a leader. However, the leader is used only to 
ensure progress. Consistency is maintained even if there are multiple leaders. 
Thus, if the leader-election algorithm fails because the network is unstable, 
the system can fail to make progress; it cannot become inconsistent. The 
system will again make progress when it becomes stable and a single leader 
is elected. 

3 An Informal Description of Disk Synod 

We now informally describe the Disk Synod algorithm and explain why 
it works. We also discuss its relation to classic Paxos's Synod Protocol. 
Remember that, in normal operation, only a single leader will be executing 
the algorithm. The other processors do nothing; they simply wait for the 
leader to inform them of the outcome. However, the algorithm must preserve 
consistency even when it is executed by multiple processors, or when the 
leader fails before announcing the outcome and a new leader is chosen. 

3.1 The Algorithm 

We assume that each processor p starts with an input value input[p]. 2 As 
in Paxos's Synod algorithm, a processor executes a sequence of numbered 
ballots, with increasing ballot numbers. A ballot number is a positive inte- 
ger, and different processors use different ballot numbers. For example, if 
the processors are numbered from 1 through N, then processor i could use 
ballot numbers i, i + N, i + 2N, etc. A ballot has two phases: 

Phase 1 Choose a value v. 
Phase 2 Try to commit v. 

In either phase, a processor aborts its ballot if it learns that another pro- 
cessor has begun a higher- numbered ballot. In that case, the processor may 

2 If processor p fails, it can restart with a new value of input [p]. 
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then choose a higher ballot number and start a new ballot. If the processor 
completes phase 2 without aborting — that is, without learning of a higher- 
numbered ballot — then value v is committed and the processor can output it. 
Since a processor does not choose the value to be committed until phase 2, 
phase 1 can be performed once for any number of separate instances of the 
algorithm. 

To ensure consistency, we must guarantee that two different values can- 
not be successfully committed — either by different processors or by the same 
processor in two different ballots. To ensure that the algorithm is nonblock- 
ing, we must guarantee that, if there is only a single processor p executing 
it, then p will eventually commit a value. 

In practice, when a processor successfully commits a value, it will write 
on its disk block that the value was committed and also broadcast that 
fact to the other processors. If a processor learns that a value has been 
committed, it will abort its ballot and simply output the value. It is obvious 
that this optimization preserves correctness; we will not consider it further. 

To execute the algorithm, a processor p maintains a record dblock[p] 
containing the following three components: 

mbal The current ballot number. 

bal The largest ballot number for which p reached phase 2. 

inp The value p tried to commit in ballot number bal. 

Initially, bal equal 0, inp equals a special value NotAnlnput that is not a 
possible input value, and mbal is any ballot number. We let efoA;[e(|[p] be 
the block on disk d in which processor p writes dblock[p]. We assume that 
reading and writing a block are atomic operations. 

Processor p executes phase 1 or 2 of a ballot as follows. For each disk 
d, it tries first to write dblock[p] to disfc[d][p] and then to read disA;[d][g] 
for all other processors q. It aborts the ballot if, for any d and q, it finds 
disk[d][q].mbal > dblock[p].mbal. The phase completes when p has written 
and read a majority of the disks, without reading any block whose mbal 
component is greater than dblock[p].mbal. When it completes phase 1, p 
chooses a new value of dblock[p].inp, sets dblock [p] . bal to dblock[p].mbal (its 
current ballot number), and begins phase 2. When it completes phase 2, p 
has committed dblock[p].inp. 

To complete our description of the two phases, we now describe how pro- 
cessor p chooses the value of dblock[p].inp that it tries to commit in phase 2. 
Let blocksSeen be the set consisting of dblock[p] and all the records efeA; [(£][<?] 
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read by p in phase 1 . Let nonlnitBlks be the subset of blocksSeen consisting 
of those records whose inp field is not NotAnlnput. If nonlnitBlks is empty, 
then p sets dblock[p].inp to its own input value input[p\. Otherwise, it sets 
dblock[p].inp to bk.inp for some record bk in nonlnitBlks having the largest 
value of bk.bal. 

Finally, we describe what processor p does when it recovers from a fail- 
ure. In this case, p reads its own block efo&[d][p] from a majority of disks 
d. It then sets dblock[p] to any block bk it read having the maximum value 
of bk.mbal, and it starts a new ballot by increasing dblock[p].mbal and be- 
ginning phase 1. 

3.2 Why the Algorithm Works 

Suppose processor p can read and write a majority of the disks, and all 
processors other than p stop executing the algorithm. In this case, p will 
eventually choose a ballot number greater than the mbal field of all blocks 
on the disks it can read, and its ballot will succeed. Hence, this algorithm 
is nonblocking, in the sense explained above. 

We now explain, intuitively, why the Disk Synod algorithm maintains 
consistency. First, we consider the following shared-memory version of the 
algorithm that uses single- writer, multiple-reader regular registers. 3 Instead 
of writing to disk, processor p writes dblock[p] to a shared register; and it 
reads the values of dblock[q] for other processors q from the registers. A 
processor chooses its bal and inp values for phase 2 the same way as before, 
except that it reads just one dblock value for each other processor, rather 
than one from each disk. We assume for now that processors do not fail. 

To prove consistency, we must show that, for any processors p and q, 
if p finishes phase 2 and commits the value v p and q finishes phase 2 and 
commits the value v q , then v p = v q . Let b p and b q be the respective ballot 
numbers on which these values are committed. Without loss of generality, 
we can assume b p < b q . Moreover, using induction on b q , we can assume 
that, if any processor r starts phase 2 for a ballot b r with b p < b r < b q , 
then it does so with dblock[r].inp = v p . 

When reading in phase 2, p cannot have seen the value of dblock[q].mbal 
written by q in phase 1 — otherwise, p would have aborted. Hence p's read 
of dblock[q] in phase 2 did not follow g's phase 1 write. Because reading 
follows writing in each phase, this implies that g's phase 1 read of dblock[p] 

3 A regular register is one in which a read that does not overlap a write returns the 
register's current value, and a read that overlaps one or more writes returns either the 
register's previous value or one of the values being written [6] . 



5 



must have followed p's phase 2 write. Hence, q read the current (final) 
value of dblock[p] in phase 1 — a record with bal field b p and inp field v p . 
Let bk be any other block that q read in its phase 1. Since q did not 
abort, b q > bk.mbal. Since bk.mbal > bk.bal for any block bk, this implies 
b g > bk.bal. By the induction assumption, we obtain that, if bk.bal > b p , 
then bk.inp = v p . Since this is true for all blocks bk read by q in phase 1, 
and since q read the final value of dblock [p] , the algorithm implies that q 
must set dblock[q].inp to v p for phase 2, proving that u p = v q . 

To obtain the Disk Synod algorithm from the shared-memory version, 
we use a technique due to Attiya, Bar-Noy, and Dolev [1] to implement 
a single-writer, multiple reader register with a network of disks. To write 
a value, a processor writes the value together with a version number to a 
majority of the disks. To read, a processor reads a majority of the disks 
and takes the value with the largest version number. Since two majorities of 
disks contain at least one disk in common, a read must obtain either the last 
version for which the write was completed, or else a later version. Hence, 
this implements a regular register. With this technique, we transform the 
shared-memory version into a version for a network of processors and disks. 

The actual Disk Synod algorithm simplifies the algorithm obtained by 
this transformation in two ways. First, the version number is not needed. 
The mbal and bal values play the role of a version number. Second, a 
processor p need not choose a single version of dblock[q] from among the 
ones it reads from disk. Because mbal and bal values do not decrease, earlier 
versions have no effect. 

So far, we have ignored processor failures. There is a trivial way to 
extend the shared-memory algorithm to allow processor failures. A processor 
recovers by simply reading its dblock value from its register and starting a 
new ballot. A failed process then acts like one in which a processor may 
start a new ballot at any time. We can show that this generalized version 
is also correct. However, in the actual disk algorithm, a processor can fail 
while it is writing. This can leave its disk blocks in a state in which no value 
has been written to a majority of the disks. Such a state has no counterpart 
in the shared-memory version. There seems to be no easy way to derive 
the recovery procedure from a shared-memory algorithm. The proof of the 
complete Disk Synod algorithm, with failures, is much more complicated 
than the one for the simple shared-memory version. Trying to write the 
kind of behavioral proof given above for the simple algorithm leads to the 
kind of complicated, error-prone reasoning that we have learned to avoid. 
Instead, we sketch a rigorous assertional proof in the appendix. 
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3.3 Deriving Classic Paxos from Disk Paxos 

In the usual view of a distributed fault-tolerant system, a processor performs 
actions and maintains its state in local memory, using stable storage to 
recover from failures. An alternative view is that a processor maintains the 
state of its stable storage, using local memory only to cache the contents of 
stable storage. Identifying disks with stable storage, a traditional distributed 
system is then a network of disks and processors in which each disk belongs 
to a separate processor; other processors can read a disk only by sending 
messages to its owner. 

Let us now consider how to implement Disk Synod on a network of 
processors that each has its own disk. To perform phase 1 or 2, a processor 
p would access a disk d by sending a message containing dblock[p] to disk 
d's owner q. Processor q could write dblock[p] to cfoA;[d][p], read tfeA;[d][r] 
for all r ^ p, and send the values it read back to p. However, examining 
the Disk Synod algorithm reveals that there's no need to send back all that 
data. All p needs are (i) to know if its mbal field is larger than any other 
block's mbal field and, if it is, (ii) the bal and inp fields for the block having 
the maximum bal field. Hence, q need only store on disk three values: the 
bal and inp fields for the block with maximum bal field, and the maximum 
mbal field of all disk blocks. Of course, q would have those values cached in 
its memory, so it would actually write to disk only if any of those values are 
changed. 

A processor must also read its own disk blocks to recover from a failure. 
Suppose we implement Disk Synod by letting p write to its own disk before 
sending messages to any other processor. This ensures that its own disk 
has the maximum value of disk[d][p].mbal among all the disks d. Hence, 
to restart after a failure, p need only read its block from its own disk. In 
addition to the mbal, bal, and inp value mentioned above, p would also keep 
the value of dblock[p] on its disk. 

We can now compare this algorithm with classic Paxos's Synod proto- 
col [10]. The mbal, bal, and inp components of dblock[p] are just lastTried[p], 
nextBal[p], and prevVote[p] of the Synod Protocol. Phase 1 of the Disk 
Synod algorithm corresponds to sending the NextBallot message and receiv- 
ing the LastVote responses in the Synod Protocol. Phase 2 corresponds to 
sending the BeginBallot and receiving the Voted replies. 4 The Synod Pro- 
tocol's Success message corresponds to the optimization mentioned above 

4 In the Synod Protocol, a processor q does not bother sending a response if p sends 
it a disk block with a value of mbal smaller than one already on disk. Sending back the 
maximum mbal value is an optimization mentioned in [10]. 



7 



of recording on disk that a value has been committed. 

This version of the Disk Synod algorithm differs from the Synod Protocol 
in two ways. First, the Synod Protocol's NextBallot message contains only 
the mbal value; it does not contain bal and inp values. To obtain the Synod 
Protocol, we would have to modify the Disk Synod algorithm so that, in 
phase 1 , it writes only the mbal field of its disk block and leaves the bal and 
inp fields unchanged. The algorithm remains correct, with essentially the 
same proof, under this modification. However, the modification makes the 
algorithm harder to implement with real disks. 

The second difference between this version of the Disk Synod algorithm 
and the Synod Protocol is in the restart procedure. A disk contains only 
the aforementioned mbal, bal, and inp values. It does not contain a sepa- 
rate copy of its owner's dblock value. The Synod Protocol can be obtained 
from the following variant of the Disk Synod algorithm. Let bk be the block 
tfeA;[(i][p] with maximum bal field read by processor p in the restart proce- 
dure. Processor p can begin phase 1 with bal and inp values obtained from 
any disk block bk', written by any processor, such that bk' .bal > bk.bal. 
It can be shown that the Disk Synod algorithm remains correct under this 
modification too. 

4 Conclusion 

4.1 Implementation Considerations 

Implicit in our description of the Disk Synod algorithm are certain assump- 
tions about how reading and writing are implemented when disks are ac- 
cessed over a network. If operations sent to the disks may be lost, a processor 
p must receive an acknowledgment from disk d that its write to disA;[(i][p] 
succeeded. This may require p to explicitly read its disk block after writing 
it. If operations may arrive at the disk in a different order than they were 
sent, p will have to wait for the acknowledgment that its write to disk d 
succeeded before reading other processors' blocks from d. Moreover, some 
mechanism is needed to ensure that a write from an earlier ballot does not 
arrive after a write from a later one, overwriting the later value with the 
earlier one. How this is achieved will be system dependent. (It is impossible 
to implement any fault-tolerant system if writes to disk can linger arbitrarily 
long in the network and cause later values to be overwritten.) 

Recall that, in Disk Paxos, a sequence of instances of the Disk Synod 
algorithm is used to commit a sequence of commands. In a straightforward 
implementation of Disk Paxos, processor p would write to its disk blocks the 
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value of dblock[p] for the current instance of Disk Synod, plus the sequence 
of all commands that have already been committed. The sequence of all 
commands that have ever been committed is probably too large to fit on a 
single disk block. However, the complete sequence can be stored on multiple 
disk blocks. All that must be kept in the same disk block as dblock[p] is a 
pointer to the head of the queue. For most applications, it is not necessary 
to remember the entire sequence of commands [10, Section 3.3.2]. In many 
cases, all the data that must be kept will fit in a single disk block. 

In the application for which Disk Paxos was devised (a future Compaq 
product), the set of processors is not known in advance. Each disk contains 
a directory listing the processors and the locations of their disk blocks. 
Before reading a disk, a processor reads the disk's directory. To write a 
disk's directory, a processor must acquire a lock for that disk by executing 
a real-time mutual exclusion algorithm based on Fischer's protocol [7]. A 
processor joins the system by adding itself to the directory on a majority of 
disks. 

4.2 Concluding Remarks 

We have presented Disk Paxos, an efficient implementation of the state 
machine approach in a system in which processors communicate by accessing 
ordinary (nonprogrammable) disks. In the normal case, the leader commits 
a command by writing its own block and reading every other processor's 
block on a majority of the shared disks. This is clearly the minimal number 
of disk accesses needed. 

Disk Paxos was motivated by the recent development of the Storage Area 
Network (SAN) — an architecture consisting of a network of computers and 
disks in which all disks can be accessed by each computer. Commodity disks 
are cheaper than computers, so using redundant disks for fault tolerance is 
more economical than using redundant computers. Moreover, since disks 
do not run application-level programs, they are less likely to crash than 
computers. 

Because commodity disks are not programmable, we could not simply 
substitute disks for processors in the classic Paxos algorithm. Instead we 
took the ideas of classic Paxos and transplanted them to the SAN environ- 
ment. What we obtained is almost, but not quite, a generalization of classic 
Paxos. Indeed, when Disk Paxos is instantiated to a single disk, we obtain 
what may be called Shared- Memory Paxos. Algorithms for shared memory 
are usually more succinct and clear than their message passing counter- 
parts. Thus, Disk Paxos can be considered yet another revisiting of classic 
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Paxos that exposes its underlying ideas by removing the message-passing 
clutter. Perhaps other distributed algorithms can also be made more clear 
by recasting them in a shared-memory setting. 
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Appendix 

We now give a precise specification of the consensus problem solved by 
the Disk Synod algorithm and of the algorithm itself. The specification is 
written in TLA + [11], a formal language that combines the temporal logic of 
actions (TLA) [8], set theory, and first-order logic with notation for making 
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definitions and encapsulating them in modules. In the course of writing the 
specifications, we try to explain any TLA + notation whose meaning is not 
self-evident. These specifications have been debugged with the aid of the 
TLC model checker [17]. 5 

We prove only consistency of the algorithm. We feel that the nonblocking 
property is sufficiently obvious not to need a formal proof. We therefore do 
not specify or reason about liveness properties. This means that we make 
hardly any use of temporal logic. 

A.l The Specification of Consensus 

We now formally specify the consensus problem. We assume N processors, 
numbered 1 through N. Each processor p has two registers: an input register 
input[p] that initially equals some element of a set Inputs of possible input 
values, and an output register output[p] that initially equals a special value 
NotAnlnput that is not an element of Inputs. Processor p chooses an output 
value by setting output [p] . It can also fail, which it does by setting input [p] 
to any value in Inputs and resetting output [p] to NotAnlnput. The precise 
condition to be satisfied is that, if some processor p ever sets output[p] to 
some value v, then 

• v must be a value that is, or at one time was, the value of input[q] for 
some processor q 

• if any processor r (including p itself) later sets output[r] to some value 
w other than NotAnlnput, then w = v. 

We specify only safety. There is no liveness requirement, so the specification 
is satisfied if no processor ever changes output [p]. 

TLA + specifications are organized into modules. The specification of 
consensus is in a module named SynodSpec, which begins: 

module SynodSpec 

extends Naturals 

The extends statement imports the Naturals module, which defines the set 
Nat of natural numbers and the usual arithmetic operations. It also defines 
i . . j to be the set of natural numbers from i through j . We next declare 
the specification's two constants: the number ./V of processors, and the set 

5 The typeset versions were generated manually from the actual TLA + specifications 
by a procedure that may have introduced errors. 
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Inputs of inputs; and we assert the assumption that iV is a positive natural 
number. 

CONSTANT N, Inputs 
ASSUME (N G Nat) A (N > 0) 

In TLA + , every value is a set, so we don't have to assert that Inputs is a 
set. We next define two constants: the set Proc of processors, and the value 
NotAnlnput. In TLA + , = means is defined to equal, and CHOOSE x : F(x) 
equals an arbitrary value x such that F(x) is true (if such an x exists). 

Proc = 1 . . N 

NotAnlnput = CHOOSE c : c Inputs 

We next declare the variables input and output. 

VARIABLES input, output 

To write the specification, we introduce two internal variables: alllnput, 
which equals the set of all current and past values of input [p] , for all pro- 
cessors p; and chosen, which records the first input value output by some 
processor (and hence, the value that all processors must henceforth output). 
These variables are internal or "hidden" variables. In TLA, such variables 
are bound variables of the temporal existential quantifier 3 . Since inter- 
nal variables aren't part of the specification, they should not be declared 
in module SynodSpec. One way to introduce such variables in TLA + is to 
declare them in a submodule. So, we introduce a submodule called Inner. 

I MODULE Inner 

variables alllnput, chosen 

Before going further, we explain some TLA + notation. In programming 
languages, the variables input and output would be arrays indexed by the 
Proc. What programmers call an array indexed by S, mathematicians call 
a function with domain S. TLA + uses the notation [igSh e(x)] for the 
function / with domain S such that f[x] = e(x) for all x in S. It denotes by 
[S — > T\ the set of all functions / with domain S such that f[x] £ T for all 
x £ S. TLA + allows a conjunction or disjunction to be written as a list of 
formulas bulleted by A or V. Indentation is used to eliminate parentheses. 
We now define Unit to be the predicate describing the initial state. 
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Unit = A input 6 [Proc — > Inputs] 

A output = \p € Proc i— > NotAnlnput] 

A chosen = NotAnlnput 

A alllnput = {mptiifp] : p £ Proc} 

We next define the two actions, Choose(p) and Fail(p), that describe the 
operations that a processor p can perform. In TLA, an action is a formula 
with primed and unprimed variables that describes the relation between the 
values of the variables in a new (primed) state and their values in an old 
(unprimed) state. For example, in a system with the two variables x and y, 
the action (x' = x + 1) A (t/ = y) corresponds to the programming-language 
statement x : = x + 1. A conjunct with no primed variables is an enabling 
condition. 

In TLA + , the expression [/ except ! [x] = e] represents the function / 
that is the same as / except that f[x] = e. Thus, /' = [/ except ![c] = e] 
corresponds to the programming-language statement f[c] : = e, except that 
it says nothing about variables other than /. An action must explicitly state 
what remains unchanged. We do this with the expression unchanged v, 
which means v' = v. Leaving a tuple (v\, . . . , v n ) unchanged is equivalent 
to leaving all its components V{ unchanged. 

The Choose(p) action represents the processor p choosing its output. 
It is enabled iff output [p] equals NotAnlnput. If chosen is NotAnlnput, 
then chosen and output [p] are set to any element of alllnput. Otherwise, 
output[p] is set to chosen. 

Choose(p) = 
A output [p] = NotAnlnput 
A IF chosen = NotAnlnput 

THEN 3 ip G alllnput : A chosen' = ip 

A output' = [output EXCEPT ! [p] = ip] 
ELSE A output' = [output except ! [p] = chosen] 
A UNCHANGED chosen 
A UNCHANGED (input , alllnput) 

The Fail(p) action represents processor p failing. It is always enabled. It 
sets output[p] to NotAnlnput, sets input[p] to any element of Inputs, and 
adds that element to the set alllnput. 

Fail(p) = 

A output' = [output except ! [p] = NotAnlnput] 
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A 3 ip e Inputs : A input' = [input EXCEPT ! [p] = ip] 

A alllnput' = alllnput U {ip} 
A UNCHANGED chosen 

We next define the next-state action INext, which describes all possible 
steps. We then define ISpec, the specification with the internal variables 
chosen and alllnput visible. It asserts that the initial state satisfies Unit, 
and every step either satisfies INext or leaves all the variables unchanged. 
Formula ISpec is defined to be a temporal formula, using the ordinary op- 
erator □ (always) of temporal logic, and the TLA notation that [N] v equals 
N V (V = v). These definitions end the submodule. 

INext = 3p £ Proc : Choose (p) V Fail(p) 

ISpec Unit A a[INext\^ n p U ^ t output, chosen, alllnput) 



Finally, we define SynodSpec, the complete specification, to be ISpec with 
the variables chosen and alllnput hidden — that is, quantified with the tem- 
poral existential quantifier 3 of TLA. The precise meaning of the TLA + 
constructs used here is unimportant. 

IS (chosen, alllnput) = instance Inner 

SynodSpec = 3 chosen, alllnput : IS (chosen, alllnput) ! ISpec 



This ends module SynodSpec. 

A. 2 The Disk Synod Algorithm 

The Disk Synod algorithm is specified by a module DiskSynod that imports 
all the declarations and definitions from the SynodSpec module. 

module DiskSynod 

extends SynodSpec 

The algorithm assumes that different processors use different ballot numbers. 
Instead of fixing some specific assignment choice of ballot numbers, we let 
Ballot(p) represent the set of ballot numbers that processor p can use, where 
Ballot is an unspecified constant operator. 

We have described the algorithm in terms of a majority of disks. The 
property of majorities we need is that any two majorities has a disk in com- 
mon. If there are an even number d of disks, we can maintain that property 
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even if we consider certain sets containing d/2 disks to constitute a majority. 
We let IsMajority be an unspecified predicate so that if IsMajority(S) and 
IsMajority(T) is true for two sets S and T of disks, then S and T are not 
disjoint. 

The module now declares Ballot, IsMajority, and the constant Disk that 
represents the set of disks. It also asserts the assumptions we make about 
them. In TLA + , the expression subset S denotes the set of all subsets of 
the set S. 

constants Ballot (_), Disk, IsMajority (_) 

assume A \/ p € Proc : A Ballot (p) C {n G iVat : n > 0} 

AVge Proc : Ballot(p) fl Ballot(q) = {} 

A V5, T G SUBSET P/isA; : 

IsMajority(S) A IsMajority(T) =>(5flT/{}) 

We next define two constants: the set DiskBlock of all possible records that 
a processor can write to its disk blocks, and the record InitDB that is the 
initial value of all disk blocks. In TLA + , [f \ ^ vi,...,f n t— > u n ] is the 
record r with fields /i, ...,/„ such that r./j = for all i in 1 . . n, and 
[/i : S\,. . . ,f n : S'n] is the set of all such records with v\ an element of 
the set Si, for all i in 1 . . n. The set (J 5*, the union of all the elements of 
S, is written union S. For example, union {^4, B, C} equals A U B U C. 

DiskBlock = [mbal : (UNION {Ballot(p) : p e Proc}) U {0} , 
bal : (union {Ballot (p) : p G Proc}) U {0}, 
inp : Inputs U {NotAnlnput} ] 

InitDB = [mbal t— > 0, 6a/ i— > 0, i— > NotAnlnput] 

We now declare all the specification's variables — except for input and output, 
whose declarations are imported from SynodSpec. We have described the 
variables efeA; (the contents of the disks) and dblock above. We let phase[p] 
be the current phase of processor p, which will be set to 0 when p fails and 
to 3 when p chooses its output. For convenience, we let each processor start 
in phase 0 and begin the algorithm as if it were recovering from a failure. 
The variables disks Written and blocksRead record a processor's progress 
in the current phase; disks Written[p] is the set of disks that processor p 
has written, and blocksRead[p] [d] is the set of values p has read from disk 
d. More precisely, blocks Read[p][d] is a set of records with block and proc 
fields, where [block bk, proc t— > q] is in blocksRead[p][d] iff p has read the 
value bk from efe&fd]^] in the current phase. For convenience, we declare 
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vars to be the tuple of all the specification's variables. We also define the 
predicate Init that defines the initial values of all variables. 

variables disk, dblock, phase, disks Written, blocksRead 

vars = (input, output, disk, phase, dblock, disks Written, blocksRead) 

Init = A input G [Proc — > Inputs] 

A output = [p G Proc NotAnlnput] 

A disk = [d G PisA; Proc i— > PmiPP]] 

A phase = [p G Proc i— > 0] 

A dblock = [p G Proc i-> InitDB] 

A output = \p G Proc i— > NotAnlnput] 

A disks Written = \p G Proc ^ {}] 

A blocksRead = \p G Proc i— ► [d G PisA; •—>{}]] 

We now define two operators that describe the state of a processor during 
the current phase: hasRead(p, d, q) is true iff p has read and 
allBlocksRead(p) equals the set of all efeAi[ei] [</] values that j» has read during 
the current phase. The TLA + expression let def IN exp equals expression 
exp in the context of the local definitions in def . 

hasRead(p , d , q) = 3 br £ blocks Read[p][d] : br.proc = q 

allBlocksRead(p) = 

let allRdBlks = union {blocks Read[p] [d] : d G Disk} 
IN {br. block : br G allRdBlks} 

We now define InitializePhase(p) to be an action that sets disksWritten[p] 
and WocfcsPea(i[p] to their initial values, to indicate that p has done no 
reading or writing yet in the current phase. This action will be used to 
define other actions that make up the next-state relation; it itself is not part 
of the next-state relation. 

InitializePhase(p) = 

A disksWritten' = [disksWritten EXCEPT l[p] = {}] 

A blocksRead' = [blocksRead EXCEPT ! [p] = [d G Disk i-> {}]] 

We now define the actions that will form part of the next-state action. These 
actions describe all the atomic actions of the algorithm that a processor p 
can perform. The first is StartB allot (p) in which p initiates a new ballot. 
We all p to do this at any time during phase 1 or 2. The action sets phase[p] 
to 1, increases dblock[p].mbal, and initializes the phase, 
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StartBallot(p) = 
A phase[p] G {1,2} 
A phase' = [phase except ! [p] = 1] 
A 3b e Ballot(p) : A b > dblock[p].mbal 

A dblock' = [dblock EXCEPT \ [p].mbal = b] 

A InitializePhase(p) 

A unchanged (input, output, disk) 

In action Phase\or2 Write(p, d), processor p writes rfisfc[(i][p] and adds d to 
the set disks Written[p] of disks written by p. The action is enabled iff p is in 
phase 1 or 2. 6 In the TLA + expression [/ except ![c] = e], an @ appearing 
in e stands for f[c\. Thus, x' = [x except ! [c] = @ + 1] corresponds to the 
programming-language statement x[c] : = x[c] + 1. 

Phase! or2Write(p, d) = 
A phase[p) G {1,2} 

A disk' = [disk EXCEPT l[d][p] = dblock[p}} 

A disksWritten' = [disksWritten except ! [p] = @ U {d}) 

A unchanged (input, output, phase, dblock, blocksRead) 

Action Phaselor2Read(p, d, q) describes p reading efeA;[(i][g]. It is enabled 
iff d is in disks Written[p], meaning that p has already written its block to 
disk d. (This implies that p is in phase 1 or 2.) We allow p to reread a 
disk block it has already read. If disk[d][q].mbal is less than p's current 
mbal value, then blocksRead[p][d] is updated and p continues executing its 
ballot. Otherwise, p aborts the ballot and begins a new one. The except 
construct has a more general form for "arrays of arrays". For example, 
the formula x' = [x except ![a][6] = e] corresponds to the programming- 
language statement x[a][6] := e. 

Phaselor2Read(p,d,q) = 
A d G disks Written [p] 
A IF disk[d][q].mbal < dblock[p].mbal 
THEN A blocksRead' = 

[blocksRead except 

l[p][d] = @ U {[block i ^ disk[d][q], proc (->• q}}} 
A UNCHANGED 

(input, output, disk, phase, dblock, disksWritten) 
else StartBallot(p) 

6 We could add the enabling condition d ^ disksWritten[p], but it's not necessary 
because the action is a no-op, leaving all variables unchanged, if p has already written its 
current value of dblock[p] to disk d. 
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The action EndPhaselor2(p) describes processor p successfully finishing 
phase 1 or 2. It is enabled when p is in phase 1 or 2 and, on a majority 
of the disks, p has written its block and read every other processor's block. 
When p finishes phase 1, it sets dblock[p].inp and dblock [p] . bal as described 
in Section 3.1 and starts phase 2. When p finishes phase 2, it sets output[p], 
sets phase[p] to 3, and terminates. (However, it could still fail and start 
again.) The TLA + except construct applies to records as well as functions, 
and it can have multiple "replacements" separated by commas. 

EndPhase\or2{p) = 

A IsMajority({d G disks Written[p] : 

V q G Proc\{p} : hasRead(p, d, q)}) 
A V A phase[p] = 1 
A dblock' = 

[dblock EXCEPT 

\[p].bal = dblock[p].mbal, 
! [p].inp = 

LET blocksSeen = allBlocksRead(p) U {dblock[p]} 
nonlnitBlks = 

{bs G blocksSeen : bs.inp / NotAnlnput] 
maxBlk = 

CHOOSE b G nonlnitBlks : 

V c G nonlnitBlks : b.bal > c.bal 
IN IF nonlnitBlks = {} THEN input[p] 

else maxBlk.inp] 

A UNCHANGED output 

V A phase[p] = 2 

A output' = [output EXCEPT \ [p] = dblock[p] .inp] 
A UNCHANGED dblock 

A phase' = [phase except ! [p] = @ + 1] 

A InitializePhase(p) 

A UNCHANGED (input, disk) 

Action Fail(p) represents a failure by processor p. The action is always 
enabled. It chooses a new value of input [p], sets phase[p] to 0 and initializes 
dblock[p], output [p], disks Written [p], and blocks Read[p]. 

Fail(p) = 

A 3 ip G Inputs : input' = [input EXCEPT ! [p] = ip] 
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A phase' = [phase except ![p] = 0] 

A dblock' = [dblock except ![p] = InitDB] 

A output' = [output except \[p] = NotAnlnput] 

A InitializePhase(p) 

A UNCHANGED 

The next two actions describe failure recovery. In PhaseORead(p, d), proces- 
sor p reads cfoA;[(i][p], recording the value read in blocks Read[p]. Again, we 
allow redundant reads of the same disk block. In EndPhaseO(p), processor 
p completes its recovery and enters phase 1, as described in Section 3.1. 

PhaseORead(p, d) = 
A phase[p] = 0 

A blocksRead' = [blocksRead except 

\[p][d] = @ U {[block i ^ disk[d][p], proc h-> p]}] 
A unchanged (input, output, disk, phase, dblock, disksWritten) 

EndPhaseO(p) = 
A phase[p] = 0 

A IsMajority({d £ Disk : hasRead(p, d, p)}) 
A 3 b G Ballot(p) : 

A Vr 6 allBlocksRead(p) : b > r.mbal 
A dblock' = [dblock EXCEPT 

l[p] = [(CHOOSE r G allBlocksRead(p) : 

Vs 6 allBlocksRead(p) : r.bal > s.bal) 
except l.mbal = b] } 

A InitializePhase(p) 

A phase' = [phase except ! [p] = 1] 

A UNCHANGED (input, output, disk) 

As in most TLA specifications, we define the next-state action Next that 
describes all possible steps of all processors. We then define the formula 
DiskSynodSpec, our specification of the algorithm, to assert that the ini- 
tial state satisfies Init and every step either satisfies Next or leaves all the 
variables unchanged. 

Next = 3p G Proc : 

V StartBallot(p) 

V 3 d G Disk : V PhaseORead(p, d) 

V Phaselor2Write(p, d) 

V 3 q £ Proc \ {p} : Phaselor2Read(p, d, q) 

V EndPhaselor2(p) 
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V Fail(p) 

V EndPhaseO(p) 

DiskSynodSpec = Init A □ [Next] vars 

The module ends by asserting the correctness of the algorithm, which means 
that the algorithm's specification implies the formula SynodSpec that is its 
correctness condition. 

theorem DiskSynodSpec =>- SynodSpec 



A. 3 An Assertional Proof 

We now sketch a proof of the correctness of the Disk Synod algorithm — 
that is, a proof that DiskSynodSpec implies SynodSpec. Formula SynodSpec 
equals 3 chosen, alllnput : ISpec. 7 To prove such a formula, we must find 
Skolem functions with which to instantiate the bound variables chosen and 
alllnput, and then prove that DiskSynodSpec implies ISpec, when chosen and 
alllnput are defined to equal those Skolem functions. The choice of Skolem 
functions is called a refinement mapping. However, we cannot define such 
a refinement mapping because chosen and alllnput record history that is 
not present in the actual state of the algorithm. Instead, we add chosen 
and alllnput to the algorithm specification as history variables. Formally, 
we define a specification HDiskSynodSpec such that 

DiskSynodSpec = 3 chosen, alllnput : HDiskSynodSpec 

We then prove that HDiskSynodSpec implies ISpec, from which we infer by 
simple logic that DiskSynodSpec implies SynodSpec. 

The initial predicate HInit of HDiskSynodSpec is the conjunction of the 
initial predicate Init of DiskSynodSpec with formulas that specify the initial 
values of chosen and alllnput. Its next-state action HNext is the conjunction 
of the next-state action Next of DiskSynodSpec with formulas that specify 
the values of chosen' and alllnput' as functions of the (unprimed and primed) 
values of the other variables. A general theorem of TLA asserts that, if the 
variable x does not occur in /, N, or the tuple y of variables, then 

/AD[iV] y = Bx : (lA(x=f(y)))AD[NA(x' = g(x,y,y > ))} {x!y) 

7 Actually, 3 chosen, alllnput : ISpec is not a legal TLA + formula; we should instead 
write 3 chosen, alllnput : IS(chosen, alllnput)] ISpec. 
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for any / and g. This result implies that the specification obtained from 
HDiskSynodSpec by hiding (existentially quantifying) chosen and alllnput 
is equivalent to DiskSynodSpec. 

We define HDiskSynodSpec in a module HDiskSynod that extends the 
DiskSynod module and declares chosen and alllnput as variables. 

module HDiskSynod 1 

EXTENDS DiskSynod 
variables alllnput, chosen 

The initial values of chosen and alllnput are the same as in the initial 
predicate of Ispec. 

HInit = A Init 

A chosen = NotAnlnput 

A alllnput = {input[p] : p G Proc} 

The action HNext ensures that chosen equals the first output value that is 
different from NotAnlnput, and that alllnput always equals the set of all 
input values that have appeared thus far. 

HNext = 
A Next 

A chosen' = let hasOutput(p) = output'[p] / NotAnlnput 
IN IF V chosen ^ NotAnlnput 

V Vp G Proc : -^hasOutput(p) 
THEN chosen 

else output' '[choose p G Proc : hasOutput(p)] 
A alllnput' = alllnput U {input' [p] : p G Proc} 

The module then defines HDiskSynodSpec in the usual way, and asserts that 
it implies ISpec, with chosen and alllnput replaced by the variables of the 
same name declared in the current module. (Again, the details of how this 
is expressed in TLA + are not important.) 

HDiskSynodSpec = HInit A n[HNext] {vars> chosen> allInput ) 

theorem HDiskSynodSpec IS (chosen, alllnput)] ISpec 
I 

We now outline the proof of this theorem. Let ivars be the tuple of all 
variables of ISpec: 
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ivars = (input, output, chosen, alllnput) 

To prove that HDiskSynodSpec implies ISpec we must prove 8 

theorem Rl HInit =>■ Unit 

THEOREM R2 HInit Aa[HNext] {varStChosentallInput) =► a[INext] wars 

The proof of -Rl is trivial. To prove R2, standard TLA reasoning shows that 
it suffices to find a state predicate HInv for which we can prove: 

THEOREM R2a HInit A 0[HNext]( vars ^ chosen, alllnput) ^Hlnv 
THEOREM R2b HInv A HInv' A HNext =>- INext V (UNCHANGED ivars) 

A predicate HInv satisfying R2 a is said to be an invariant of the specification 
HInit A n[HNext]( varS7ChosenjallInput y To prove R2a, we make HInv strong 
enough to satisfy: 

THEOREM II Hinit HInv 
theorem 12 HInv A HNext =>- HInv' 

A predicate HInv satisfying 12 is said to be an invariant of the action HNext. 
A standard TLA theorem asserts that II and 12 imply R2a. 

There are two general approaches to defining HInv. In both, we write 
HInv as a conjunction HI\ A ... A HI^. In the bottom-up method, we define 
the HIi in increasing order of i, so that each conjunction HI\ A ... A HI/, is 
an invariant of HNext. We stop when we obtain an invariant strong enough 
to prove R2b. In the top-down method, we start by defining HI^ so that 
R2b is satisfied with HI^ substituted for HInv. We then define the HIi m 
decreasing order of i so that HIi A. ■ • A HI \ A HNext =>■ HI' i+1 , stopping when 
we obtain an invariant of HNext. In practice, one uses a combination of the 
two methods — with a lot of backtracking. Here, we present the invariant in 
a bottom-up fashion. 

If the set of disks is empty, then IsMajority(D) is false for all subsets D of 
Disk. (This follows from the assumption about IsMajority by substituting D 
for both S and T .) Hence, HDiskSynodSpec implies that the system remains 
forever in its initial state, trivially satisfying ISpec. It therefore suffices to 
consider only the case when Disk is nonempty: 

ASSUME Disk / {} 

8 The symbols Unit and INext are not defined in the current context; to be rigorous, 
we should define them to equal IS(chosen, alllnput)\llnit and IS(chosen, alllnput)] INext, 
respectively. 
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The standard starting point for a TLA proof is a simple "type invariant" , 
which we call HInvl, asserting that all variables have the correct type: 



HInv 1 = 

A input G [Proc — > Inputs] 

A output G [Proc — ► Inputs U {iVoL4nPrt.ptii}] 

A cfeA; G [DisA; — ► [Proc — > Pvis&P/ocA;]] 

A p/iase G [Proc — > 0 . . 3] 

A dblock G [Proc — ► DiskBlock] 

A output G [Proc — > Inputs U {.ATotAn/npitt}] 

A disksWritten G [Proc — ► SUBSET DisA:] 

A blocksRead G [Proc — ► [ZfeA; — ► 

SUBSET [block : DiskBlock , proc : Proc]]] 
A alllnput G subset Inputs 
A chosen G Inputs U {NotAnlnput} 
A mpui G [Proc — ► Inputs] 

Our first lemma asserts that HInvl is an invariant of HNext: 
lemma P?a HInvl A HNext HInvl'. 

The proofs of Theorem P26 and of most lemmas appear in Section A. 4 
below. 

Before going any further, we define some useful state functions. First, 
we let MajoritySet be the set of all subsets of the set of disks containing 
a majority of them; we let blocksOf (p) be the set of all copies of p's disk 
blocks in the system — that is, dblock[p], p's blocks on disk, and all blocks 
of p read by some processor; and we let allBlocks be the set of all copies of 
all disk blocks of all processors. 

MajoritySet = {D G subset Disk : IsMajority(D)} 

blocksOf (p) = 

LET rdBy(q,d) = {br G blocksRead[q][d] : br.proc = p} 
IN {dblock[p]} U {disk[d][p] : d G Disk} 

U {br. block : br G UNION {rdBy(q, d) : q G Proc, d G Disk}} 

allBlocks = UNION {blocksOf (p) : p G Proc} 

The next conjunct of HInv describes some simple relations between the 
values of the different variables. 
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HInv2 = 

A Vp 6 Proc : 

\fbk G blocksOf(p) : A bk.mbal G B allot (p) U {0} 
A &&.&a/ G Ballot(p) U {0} 
A (bk.bal = 0) = (bk.inp = NotAnlnput) 
A bk.mbal > bk.bal 

A Vp 6 Proc, d G PisA; : 

A (tie disks Written[p]) A p/iase[p] G {1,2} 

A d£sA;[d][p] = dfrZocAifp] 
A (p/iasefp] G 1,2) =>• A (6/ocA;sPea(i[]9][rf] / {}) =^ 

(d G disfcs Written [p]) 
A ~^hasRead(p, d, p) 

A Vp £ Proc : 

A (phase[p] = 0) =>- A dWocA;[p] = InitDB 
A disks Written[p] = {} 
A Vd G PisA; : V 6r G 6/ocfoPead[p][d] : 
A br.proc = p 
A br. block = disk[d][p\ 
A (p/iase[p] ^ 0) =>• A dblock[p].mbal G B allot (p) 

A dWoc£;[p].&a/ G Ballot (p) U {0} 
A V d G PisA; : V 6r G 6/ocA:sPea(i[p][d] : 
br. block. mbal < dblock[p].mbal 
A (p/iase[p] G {2,3}) =>■ (d&ZocAi[p].&ed = dblock[p].mbal) 
A otitputfp] = if p/iase[p] = 3 then dblock[p].inp else NotAnlnput 

A chosen G alllnput U {NotAnlnput} 
A Vp G Proc : A input\p] G alllnput 

A (chosen = NotAnlnput) =>■ (otitputfp] = NotAnlnput) 

The invariance of HInvl A HInv2 follows from Lemma /2a and: 

LEMMA 756 fflnul A P7nu2 A HNext =4- HInv2' 

The next conjunct of PPiv expresses the observation that if processors 
p and g have each read the other's block from disk d during their current 
phases, then at least one of them has read the other's current block. 

HInv3 = Vp, g G Proc, d G Disk : 
A phase[p] G {1,2} 
A phase[q] G {1,2} 
A hasRead(p, d, q) 
A hasRead(q, d, p) 
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=>■ V [block i— > tiWoc£;[g], proc i— > g] G Wocfe-Read [p] [d] 
V [6/ocA; i ^ e£6Zoc&[p], proc i— ► p] G £>ZocA;s.Rea(i[g][(i] 

LEMMA 75c fflnvl A HInv3 A fflVeai HInv3' 



The next conjunct of the invariant expresses relations among the mbal and 
bal values of a processor and of its disk blocks. Its first conjunct asserts that, 
when p is not recovering from a failure, its mbal value is at least as large as 
the bal field of any of its blocks, and at least as large as the mbal field of 
its block on some disk in any majority set. Its second conjunct asserts that, 
in phase 1, its mbal value is actually greater than the bal field of any of its 
blocks. Its third conjunct asserts that, in phase 2, its bal value is the mbal 
field of all its blocks on some majority set of disks. The fourth conjunct 
asserts that the bal field of any of its blocks is at most as large as the mbal 
field of all its disk blocks on some majority set of disks. 

HInvA = 
Vp € Proc : 

A (phase[p) / 0) =>• 

A V bk G blocksOf (p) : dblock[p].mbal > bk.bal 
A V D G MajoritySet : 

3d £ D : A dblock[p].mbal > disk[d][p].mbal 
A dblock [p] . bal > disk[d][p].bal 
A (phase[p] = 1) =^ (V bk £ blocksOf (p) : dblock[p].mbal > bk.bal) 
A (phase[p] G {2, 3}) 

(3D G MajoritySet : V d G D : disk[d][p].mbal = dblock[p].bal) 
A Vbk G blocksOf (p) : 

3D £ MajoritySet : V d G D : disk[d][p].mbal > bk.bal 

LEMMA I2d HInvl A HInv2 A HInv2' A HInv4 A HNext HInv4' 

Before going further, we define maxBalInp(b, v) to assert that every 
block in allBlocks with bal field at least b has inp field v. 

maxBalInp(b, v) = V bk G allBlocks : (bk.bal > b) (bk.inp = w) 

We now come to a conjunct of HInv that provides some high-level insight 
into why the algorithm is correct. It asserts that, if a processor p is in 
phase 2, then either its bal and inp values satisfy maxBallnp, or else p must 
eventually abort its current ballot. Processor p will eventually abort its 
ballot if there is some processor q and majority set D such that p has not 
read g's block on any disk in D, and all of those blocks have mbal values 
greater than dblock[p].bal. (Since p must read at least one of those disks, it 
must eventually read one of those blocks and abort.) 
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HInvb = 
Vp G Proc : 

(phase[p] = 2) =$>■ V maxBalInp(dblock\p].bal, dblock[p].inp) 
V 3D £ Majority Set, q G Proc : 

\/d G D : A disk[d][q].mbal > dblock[p].bal 
A ~^hasRead(p, d, q) 

LEMMA I2e 

HInvl A HInv2 A HInv2' A HInv3 A HInv4 A #7m;5 A HNext fflm^' 

Before defining our final conjunct, we define a predicate valueChosen(v) 
that is true if u is the only possible value that can be chosen as an output. 
It asserts that there is some ballot number b such that maxBalInp{b, v) is 
true. This condition is satisfied if there is no block bk in allBlocks with 
bk.bal > b. So, valueChosen(v) must require that some processor p has 
written blocks with bal field at least b to a majority set D of the disks. (By 
maxBalInp(b, v), those blocks must have inp field v). We also ensure that, 
once valueChosen(v) becomes true, it can never be made false. This requires 
the additional condition that no processor q that is currently executing 
phase 1 with mbal value at least b can fail to see those blocks that p has 
written. So, valueChosen(v) also asserts that, for every disk d in D, if q has 
already read disft[e£][p], then it has read a block with bal field at least b. 

valueChosen(v) = 

3b e UNION {Ballot(p) : p G Proc} : 
A maxBalInp(b, v) 
A 3p G Proc, D G Majority Set : 
yd G Z> : A cfe&[e%].&aZ > 6 
A V q G Proc : 

A p/iase[g] = 1 
A dblock[q].mbal > 6 
A hasRead(q, d, p) 
=>- (3 6r G 6/ocA;si?ea(i[g][d] : 6r.6a/ > 6) 

It's obvious that, if valueChosen(v) = valueChosen(w) , then v = w. 

The final conjunct of P/nu asserts that, once an output has been cho- 
sen, valueChosen(chosen) holds, and each processor's output equals either 
chosen or NotAnlnput. 

HInvQ = A {chosen ^ NotAnlnput) valueChosen(chosen) 
A yp G Proc : output[p] G {chosen, NotAnlnput} 

LEMMA I2f HInvl A F/nw2 A HInv2' A #7ra;3 A HInv6 A fflVext #/m;6' 
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We define HInv to be the conjunction of HInvl-HInvQ. 
HInv = HInvl A HInv2 A HInv?, A HInv A A HInv 5 A HInvQ 
Theorem 12 then follows easily from Lemmas 12 a- 12 f . 

A. 4 Proofs 

We now sketch the proofs of most of the lemmas from Section A. 3 and of 
Theorem R2b. We give hierarchically structured proofs [9]. A structured 
proof consists of a sequence of statements and their proofs; each of those 
proofs is either a structured proof or an ordinary paragraph-style proof. The 
j til step in the current level-i proof is numbered Within a paragraph- 
style proof, denotes the most recent statement with that number. The 
proof statement Q.E.D." denotes the current goal — that is, the level 

i — 1 statement being proved by this step. A proof statement 

Assume: A 
Prove: P 

asserts that the assumption A implies P. If P is the current goal, then this 
is abbreviated as 

Case: A 

An assumption constant c £ S asserts that c is a new constant parameter 
that we assume is in S. We prove V c £ S : P{c) by proving 

Assume: constant c g S 
Prove: P(c) 

The assumption constant c£5 S.t. A(c) also assumes that c also sat- 
isfies A(c). A proof statement 

CHOOSE c € S S.T. P(c) 

asserts the existence of a value c in S satisfying P(c), and defines c to be 
such a value. To prove this statement, we must demonstrate the existence 
of c. 

We recommend that proofs be read hierarchically, from the top level 
down. To read the proof of a long level-i step, you should first read the 
level- (i + 1) statements that form its proof, together with the proof of the 
final "Q.E.D." step (which is usually a short paragraph), and then read the 
proof of each level- (i + 1) step, in any desired order. 
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We also use a hierarchical scheme for naming subformulas of a formula. 
If F is the name of a formula that is a conjunction, then F.i is the name of its 
i th conjunct. A similar scheme is used for a disjunction, except using letters 
instead of numbers, so F.c is the name of the third disjunct of F. If F is the 
name of the formula P =>- Q, then F.Lis the name of P and F.i? is the name 
of Q. If F is the name of the formula 3x : P(x) or Vx : P(x), then F(e) 
is the name of the formula F(e), for any expression e. This is generalized 
in the obvious way for abbreviated quantifications like 3x,y : P(x, y). For 
example, HInv5(n).R.b(E , m)(dd).2 is the formula -^hasRead(n, dd, m). 

We now give the proofs. We omit the proofs of Lemmas 1 2a and I2b, 
which require a simple but tedious case analysis for the different disjuncts 
of Next. In the informal paragraph-style proofs, we use HInvl implicitly in 
many places by tacitly assuming that variables have values of the right type. 
For example, we deduce phase' [p] = 2 from 

phase' = [phase except ! [p] = 2] 

without mentioning that this follows only if phase is a function whose domain 
contains p, which is implied by HInvl. 

A.4.1 Lemma I2c 

We prove Lemma 12 c by proving: 

Assume: 1. HInvl A HInv3 A HNext 

2. CONSTANTS p, q G Proc, d G Disk 

3. HInv3(p, q, d).L' 
Prove: HInv3(p,q,d).R' 

(1)1. Case: ^HInv3(p, q, d).L 

(2)1. Case: Phaselor2Read(p,d,q) 

Proof: Action Phaselor2Read(p, d, q) adds the record 

[block i ^ dblock[q], proc t— > q] 
to 6ZocA;s.Reae£[p][e£], making HInv3(p, q, d).R.a' true. 
(2)2. Case: Phaselor2Read(q, d,p) 

Proof: Action Phaselor2Read(q, d,p) adds the record 

[block i ^ c£6/oc&[p], proc p] 
to 6ZocA;si2ead[g][d], making HInv3(p, q, d).R.b' true. 
(2)3. Case: EndPhaseO(p) 

Proof: This implies -<hasRead(p, d, q), so HInv3(p, q, d).H is false, 
making HInv3(p, q, d)' true. 
(2)4. Case: EndPhaseO(q) 
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Proof: This implies -<hasRead(q, d,p), so HInv3(p, q, d).U is false, 
making HInv3(p, q, d)' true. 
(2)5. Q.E.D. 

Proof: By assumption 3 and the level (1) case assumption, one of the 
four conjuncts of HInv3(p, q, d).L is changed from false to true. Steps 

(2) l-(2)4 covers the four subactions of Next that can make one of those 
conjuncts false. 

(1)2. Case: HInv3(p,q,d).L 

Proof: By HInv3 (which holds by assumption 1), the case assump- 
tion implies HInv3(p, q, d).R. The only subactions of HNext that make 
HInv3(p, q, d).R false are ones that remove elements from blocksRead[p][d] 
or blocks Read[q][d] or that change dblock[p] or dblock[q]. The only such 
subactions are ones with an InitializePhase(p) or InitializePhase(q) con- 
junct, which make HInv3(p, q, d).R' false, contrary to assumption 3. 

(1)3. Q.E.D. 

PROOF: Immediate from steps (1)1 and (1)2. 

A.4.2 Lemma BksOf 

We now state and prove a simple result that will be used below. 

LEMMA BksOf 
NextAHInvl 
(Vp G Proc : 

blocksOf(p)' C (blocksOf(p)\{dblock\p]}) U { dblock' [p] } ) 

It is proved as follows. 
Assume: p e Proc 

Prove: blocksOf{p)' C {blocksOf{p) \ {dblock[p}}) U {dblock'[p}} 

Proof: The only way an HNext step creates a new block for p, rather than 
copying an existing one, is by changing dblock[p\. 

A.4.3 Lemma I2d 

Assume: 1. HInvl A HInv2 A HInv2' A HInv4 A HNext 

2. CONSTANT p G Proc 
Prove: HInv4(p)' 

(1)1. HInv4(p).l' 

(2)1. Case: (phase[p] = 0) A (phase' [p] / 0) 

(3) 1. EndPhaseO(p) 
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Proof: By the level (2) case assumption, since EndPhaseO(p) is the 
only subaction of HNext that changes phase [p] from zero to a nonzero 
value. 

(3)2. Assume: constant bk £ blocksOf(p)' S.T. bk / dblock'[p] 
Prove: dblock'[p].mbal > bk.bal 
(4)1. bk £ blocksOf{p) 

Proof: Lemma BksOf and the level (3) assumption. 
(4)2. CHOOSE Dl G MajoritySet S.T. 

V d £ D : disk[d][p].mbal > bk.bal 

Proof: Hlnv&A and (4)1 imply the existence of Dl. 

(4)3. V D £ MajoritySet : 3 d £ D : disk[d][p].mbal > bk.bal 

PROOF: Using (4)2, for any majority set D we can choose d to be 
a disk in Dl n D, which is nonempty because any two majority sets 
have an element in common. 

(4)4. 3d £ Disk : 3rb £ readBlock[p][d] : rb .block .mbal > bk.bal 
PROOF: By HInv2.3(p).l.R.3, which holds by assumption 1 and 
case assumption (2), hasRead(p, d,p) implies that blocksRead[p][d] 
consists of a single element whose block field equals cfoA;[(i][p], for 
any disk d. Step (3)1 implies that hasRead(p, d,p) is true for all 
d in some majority set D of disks. The level (3) goal then follows 
from (4)3. 

(4)5. Q.E.D. 

Proof: (4)4 and (3)1 imply dblock'[p].mbal > bk.bal. 
(3)3. HInvA(p).l.R.2' 

(4)1. 3D £ MajoritySet : 

V d £ D : A dblock'[p].mbal > disk[d][p].mbal 

A dblock'[p].bal > disk[d][p].bal 
Proof: By (3)1, dblock'[p).mbal > br.mbal and dblock'[p].bal > 
br.bal for all br £ allBlocksRead(p) . By (3)1, the level (2) case as- 
sumption, and HInv2.3(p).l, allBlocksRead(p) contains all blocks 
d£sA;[(i][p] for d in some majority set D of disks. 
(4)2. Q.E.D. 

Proof: HInv4(p).l.R.2' follows from (4)1 and (3)1, which implies 
that disk is unchanged. 
(3)4. Q.E.D. 

Proof: By (3)2 and (3)3, since (3)2 implies HInvA{p).l.R.l{bk)' ex- 
cept for the case bk = dblock'[p]; and HInv4(p).l.R.l(bk)' follows 
from (3)1 and HInv2.1{p)A in that case. 
(2)2. Case: (phase[p] / 0) A {phase' [p] / 0) 
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(3)1. A dblock'[p].mbal > dblock[p].mbal 
A dblock' [p] . bal > dblock[p].bal 
Proof: The only subactions of Next that change dblock[p] are 

StartB allot (p), EndPhaselor2(p), EndPhaseO(p), andFail(p) 
Of these, only Fail(p) can decrease dblock[p].mbal or dblock[p].bal. 
(HInv2A(p)A implies that EndPhase\or2(p) A phase[p] = 1 can- 
not decrease dblock[p].bal.) The level (2) case assumption implies 
-iFail(p). 

(3)2. HInv4(p)A.RA' 

Proof: If bk £ blocksOf(p), then HInv4(p) A.RA(bk)' follows from 

(3) 1 and HInv4(p)A.RA (which holds by assumption 1 and the level 
(2) case assumption). If bk = dblock'[p], then HInv4(p)A.RA(bk)' 
follows from HInv2.1(p)(bk) A' . We then obtain HInv4(p)A.RA' from 
Lemma BksOf. 

(3)3. HInv4(p)A.R.2' 

Proof: HNext implies that equals efoA;[p][c2] or dblock[p], 

so HInv4(p)A.R.2' follows from (3)1 and HInv4(p)A.RA, which holds 
by assumption 1 and the level (2) case assumption. 

(3)4. Q.E.D. 

Proof: By (3)2 and (3)3. 
(2)3. Q.E.D. 

Proof: By (2)1 and (2)2, since HInv4(p)A' is trivially true ii phase' [p] 
equals 0. 
(1)2. HInv4(p).2' 

(2)1. Case: (phase[p] / 1) A (phase' [p] = 1) 
(3)1. Case: phase[p] =0 

(4) 1. EndPhaseO(p) 

Proof: By the levels (2) and (3) case assumptions. 
(4)2. Vbk e blocksOf(p) : 

3D £ MajoritySet : V d G D : disk[d][p].mbal > bk.bal 
Proof: By HInv4(p)A. 
(4)3. Vbk g blocksOf{p) : 

WD £ MajoritySet : 3 d £ D : disk[d][p].mbal > bk.bal 
Proof: By (4)2, since any two majority sets have a disk in com- 
mon. 

(4)4. Vbk £ blocksOf(p) : 

3br £ allBlocksRead(p) : br.mbal > bk.bal 
Proof: By (4)1, (4)3, and HInv2.3(p)A.R.3 (which holds by as- 
sumption 1 and the level (3) case assumption). 

(4)5. Q.E.D. 
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Proof: (4)1 implies 

V br G allBlocksRead(p) : dblock'[p].mbal > br.mbal 
Therefore, HInv4(p).2.R(bk)' follows from (4)4 and (4)1 if bk G 
blocksOf(p). Step (4)1 also implies 

3 bk G blocksOf(p) : dblock'[p].bal = bk.bal 
so HInv4(p).2.R(bk)' follows from (4)4 if bk = dblock'[p]. By- 
Lemma BksOf, this proves HInv4(p).2.R' . 
(3)2. Case: phase[p] G {2,3} 

(4)1. V6A; G blocksOf(p) : dblock[p].mbal > bk.bal 

Proof: HInv4(p).l and the level (3) case assumption (which imply 
HInv4(p).l.R.l). 
(4)2. A dblock'[p].mbal > dblock[p].mbal 
A dblock'[p].bal = dblock[p].bal 
Proof: By HNext and the level (2) and (3) case assumptions, 
which imply StartB allot (p). 
(4)3. Q.E.D. 

PROOF: (4)1 and (4)2 imply HInv4(p).2.R(bk)' for bk = dblock'[p] 
and bk G blocksOf (p) . Lemma BksOf then implies HInv4(p).2.R' . 
(3)3. Q.E.D. 

Proof: The level (2) case assumption implies that (3)1 and (3)2 
cover all possibilities. 
(2)2. Case: (phase[p] = 1) A (phase' [p] = 1) 

Proof: By HNext, this implies dblock'[p] = dblock[p], so Lemma BksOf 
implies that HInv4(p).2' follows from HInv4(p).2. 
(2)3. Q.E.D. 

Proof: Since HInv4(p).2' is trivially true if phase'[p] / 1, the cases of 

(2) 1 and (2)2 are exhaustive. 
(1)3. HInv4{p)2' 

(2)1. Case: (phase[p\ / 2) A (phase' [p] = 2) 

(3) 1. EndPhaselor2(p) A (phase[p\ = 1) 

Proof: By HNext and the level (2) case assumption. 
(3)2. 3D £ MajoritySet : V d G D : disk[d][p].mbal = dblock[p].mbal 

Proof: By (3)1 and HInv2.2(p).l. 
(3)3. Q.E.D. 

Proof: (3)1 implies dblock'[p].bal = dblock[p].mbal and disk' = disk, 
which by (3)2 implies HInv4(p).3' 
(2)2. Case: (phase[p] G {2,3}) A (phase' [p] G {2,3}) 
(3)1. dblock'[p].bal = dblock[p].bal 

Proof: By HNext and the level (2) case assumption. 



33 



(3)2. V d € £)isA; : 

Phaselor2Write(p, d) =4> (disk'[d][p].mbal = dblock[p].bal 
Proof: By the level (2) case assumption and HInv2.3(p).3. 
(3)3. Q.E.D. 

Proof: HInv4{p).3l follows from HInv4(p).3, (3)1, and (3)1, since 
HNext A ~^Phaselor2Write(p, d) implies disk'[d] [p] = disk[d][p], for 
any disk d. 
(2)3. Q.E.D. 

Proof: HInv4(p).3' follows from (2)1 and (2)2 because it is trivially 
true if phase' [p] ^ {2, 3}, and HNext A(phase'[p] = 3) implies phase[p] £ 
{2,3}, 
(1)4. HInv4(p)A' 

(2)1. Case: EndPhaselor2(p) A (phase[p] = 1) 

(3)1. 3D £ MajoritySet : V d £ D : disk'[d][p].mbal > dblock'[p].bal 

PROOF: By (1)3 and the level (2) case assumption, which implies 

phase' [p] = 2. 
(3)2. disk' = disk 

Proof: By the level (2) case assumption. 
(3)3. Q.E.D. 

Proof: If bk / dblock' \p\ , then HInvA{p)A{bk)' follows from (3)2 
and HInv4(p)A(bk). If bk = dblock'[p], then it follows from (3)1. 
(2)2. Case: Fail(p) 

Proof: Fail{p) implies disk' = disk, so HInvA{p)A{bk)' follows from 
HInv4(p)A(bk) if bk / dblock'[p\. For bk = dblock[p], HInv4(p)A(bk)' 
is trivial because Fail(p) implies dblock [p] . bal = 0. 
(2)3. Case: 3d £ Disk : Phasel or2 Write (p, d) 
Proof: In this case, we have 

3d £ Disk : [disk' = disk EXCEPT l[d][p] = dblock[p]] 
and phase'[p] ^ 0. From this, HInv4(p)A, and HInv4(p)A.RA' (which 
holds by (1)1), we deduce HInv4(p)A(bk)' for bk £ blocksOf(p) H 
blocksOf (p)' . For bk = dblock'[p], we obtain HInv4(p)A(bk)' from 
HInv2A{p){bk)A' . By Lemma BksOf, this proves HInv4{p)A' . 
(2)4. Q.E.D. 

Proof: By (2)1, (2)2, (2)3, since they consider all the subactions of 
HNext that change bVal or efo&[d][p], for some disk d. 
(1)5. Q.E.D. 

Proof: By steps (1)1-(1)4. 
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A.4.4 Lemma I2e 

Simple logic shows that, to prove Lemma I2e, it suffices to prove: 

Assume: 1. HInvl A HInv2 A HInv2' A HInv3 A HInvA A HInvh A HNext 

2. CONSTANT p G Proc 

3. phase' [p] = 2 

4. ^HInv5(p).R.a' 
Prove: HInv5(p).R.b' 

(1)1. Case: (j>/wwe[p] / 2) 

(2)1. EndPhaselor2(p) A (p/iase[p] = 1) 

Proof: By HNext, assumption 3, and the level (1) case assumption. 

(2)2. CHOOSE bk G allBlocks S.T. 

(bk.bal > dblocks'[p].bal) A (bk / dblocks') 
Proof: Assumption 4 and the definition of maxBallnp imply that there 
exists bk G allBlocks' such that bk.bal > dblocks' [p].bal and bk.inp ^ 
dblocks' [p].inp. By Lemma BksOf and the definition of allBlocks, (2)1 
implies bk G allBlocks. 

(2)3. CHOOSE g G Proc\{p} S.T. 6A; G blocksOf(q) 

Proof: By (2)2 and the definition of allBlocks, there is some pro- 
cessor q such that bk G blocksOf(q). (2)1 and (2)2 imply bk.bal > 
dblock[p].mbal, so (2)1 and HInv4(p).2 imply q / p. 

(2)4. ]fl 6 MajoritySet : V d G D : cfe&[eZ][g].m&aZ > dblock'[p].bal 
Proof: By (2)3, HInvA(q)A, and (2)2. 

(2)5. 3D G MajoritySet : V d G D : cfe&[d][?].m&aZ > dblock'[p].bal 
Proof: By (2)3 (which implies p / q) and (2)4, since (2)1 (which 
implies dblock'[p].bal > 0), HInv2.\, and the assumption that differ- 
ent processors have distinct ballot numbers imply disk[d][q].mbal / 
dblock'[p].bal. 

(2)6. Q.E.D. 

Proof: (2)1 implies ~^hasRead(p, d, q)' , for all disks d. Hence, (2)5 
implies HInv5(p).R.b' . 
(1)2. Case: (phase[p] = 2) A HInv5(p).R.a 

(2)1. CHOOSE q G Proc\{p} S.T. A EndPhaselor2(q) A (phase[q] = 1) 

A dblock'[q].bal > dblock[p].bal 
A dblock'[q].inp / dblock[p].inp 
Proof: HNext, Assumption 3, and the level (1) case assumption imply 
that dblock'[p] = dblock[p]. Assumption 4, the level (1) case assump- 
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tion, the definition of maxBallnp, and Lemma BlksOf imply 
A dblock'[q].bal > dblock[p].bal 
A dblock'[q].inp ^ dblock[p].inp 

A (dblock'[q].bal / dblock[q].bal) V (dblock'[q].bal > dblock[q].bal) 
for some processor q / p. By HNext, this implies EndPhaselor2(q) A 
(phase[q] = 1). By HInv2.1 and the assumption that different pro- 
cessors have different ballot numbers, this also implies dblock'[q].bal ^ 
dblock[p].bal. 

(2)2. CHOOSE D G MajoritySet S.T. 

\/ d £ D : A <izsfc[(i] [g] .m&aZ > dblock[p].bal 
A hasRead(q, d, p) 
Proof: By HInv2.2(q, d).l and (2)1, there is a majority set D such that 
hasRead(q, d,p) and d£sA;[(i][g].TO&a/ = dblock'[q].bal, for all d € D. The 
result then follows from (2)1. 

(2)3. V d G -D : [6/ocA; c?6/ocA;[p], proc i— > p] ^ blocksRead[q] [d] 

Proof: By the level (1) case assumption, (2)1, and the definitions of 
maxBallnp and EndPhaselor2, if dblock[p] were in allBlocksRead(q), 
then dblock'[q].inp would equal dblock[p].inp, contradicting (2)1. 

(2)4. V d G D : ->3 6r G [p] [rf] : br .block .mbar > dblock[p].bal 

Proof: By the level (1) case assumption (which implies phase[p] = 2), 
HInv2.3(p).2.R.3, and HInv2.3(p).3. 

(2)5. V d G D : -^hasRead(p, d, q) 

PROOF: By HInv3, (2)2 (which implies hasRead(q, d,p) for d G D), the 
level (1) case assumption (which implies phase[p] = 2), and (2)1 (which 
implies phase[q] = 1), hasRead(p, d, q) implies 

dblock[q] G allBlocksRead(p) 
which is impossible by (2)2 and (2)4. 

(2)6. Q.E.D. 

Proof: Since (2)1 implies that disk, dblock[p].bal and hasRead(p, d, q) 
are unchanged, for all d G Disk, (2)2 and (2)5 imply HInv5(p).R.b' . 
(1)3. Case: (phase\p] = 2) A HInv5(p).R.b 
(2)1. CHOOSE D G MajoritySet, q G Proc S.T. 
(q / p) A HInv5(p).R.b(D, q) 
Proof: The level (1) case assumption implies the existence of D and q 
satisfying HInv5(p).R.b(D , q). Since any two majority sets have a disk 
in common, HInv4(p).3 then implies q ^ p. 
(2)2. Case: 3 d G D : P/iaselor2 l^rite(g, (i) 

(3)1. 3d £ D : disk' = [disk except \[d][q] = dblock[q]] 

Proof: By the level (2) case assumption. 
(3)2. dblock[q].mbal > dblock[p].bal. 
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Proof: By (2)1 and HInvA(q).l.R{2). 
(3)3. Q.E.D. 

Proof: (3)1, (3)2, and (2)1 imply HInv5(p).R.b(D, q)' . 

(2)3. Case: 3 d G D : Phaselor2Read(p, d, q) 

Proof: In this case, HInv5(p).R.b (the level (1) case assumption) im- 
plies phase'[p] = 1 (because the ballot must abort), contradicting as- 
sumption 3. 

(2)4. Q.E.D. 

Proof: Assumption 3, the level (1) case assumption, and HNext imply 
that dblock[p] is unchanged; and HNext implies that, for any d G D: 
A (disk'[d][q] / disk[d][q]) =>- Phasel or2 Write (q, d) 
A hasRead(p, d, q)' A ~^hasRead(p, d, q) =^ Phasel or2 Read (p, d, q) 
Hence, (2)2 and (2)3 cover the only cases in which HInv5(p).R.b(D , q) 
can be made false. In all other cases, HInv5(p).R.b' follows from (2)1. 
(1)4. Q.E.D. 

Proof: By HInv5(p), the cases in steps (1)1, (1)2, and (1)3 are exhaus- 
tive. 

A.4.5 Lemma I2f 

The proof of Lemma I2f uses: 

LEMMA VC V v G Inputs : HInvl A HInvA A HNext A valueChosen(v) 
=^ valueChosen(v)' 

We prove Lemma VC by proving: 

Assume: 1. constant b e union {Ballot(p) : p e Proc} 

2. constants v G Inputs, p G Proc, D € MajoritySet 

3. maxBalInp(b,v) 

4. valueChosen(v)(b).2(p, D) 

Prove: maxBalInp(b, v)' A valueChosen{v){b) .2{p , D)' 

(1)1. maxBalInp(b,v)' 

(2)1. Case: 3 q G Proc : EndPhaselor2(q) A (p/iase[g] = 1) 
(3)1. CHOOSE q G Proc S.T. EndPhaselor2(q) A (p/iase[g] = 1) 

Proof: g exists by the level (2) case assumption. 
(3)2. Case: dblock[q\.mbal > b 
(4)1. 3deD : hasRead{q,d,p) 

Proof: By (3)1 (which implies hasRead{q, d,p) for all d in some 
majority set), since any two majority sets have a disk in common. 
(4)2. 3d G D : 3 br G blocksRead[q][d] : br.block.bal > b 
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PROOF: By (4)1, case assumption (3), and assumption 4. 
(4)3. dblock'[q].inp = v 

Proof: By (4)2, maxBalInp(b,v) (assumption 3), (3)1, and the 

definition of EndPhaselor2. 
(4)4. Q.E.D. 

PROOF: (4)3, maxBalInp(b,v) (assumption 3), (3)1 (which implies 
blocksOf(r)' = blocksOf(r) for r ^ q), and Lemma BlksOf imply 
maxBalInp(b, v)' . 
(3)3. Case: dblock[q].mbal < b 

PROOF: By (3)1, this implies dblock'[q].bal < b, so maxBalInp(b, v) 
(assumption 3), (3)1, and Lemma BlksOf imply maxBalInp(b, v)' . 
(3)4. Q.E.D. 

Proof: By (3)2 and (3)3. 
(2)2. Q.E.D. 

Proof: By (2)1, since HNext A {allBlocks' / allBlocks) implies 
3 q G Proc : V EndPhaselor2(q) A (phase[q] = 1) 
V Fail(q) 

and maxBalInp(b, v) A Fail(q) obviously implies maxBalInp(b, v)' . 
(1)2. valueChosen(v)(b).2(p, D)' 
(2)1. Assume: constant d £ D 
Prove: disk'[d][p\.bal > b 
(3)1. Case: Phaselor2Write(p, d) 

(4)1. 3dd eD : dblock[p].bal > disk[dd\[p].bal 

Proof: By HInv4(p).l.R.2 (which holds because the level (3) case 
assumption implies phase[p] / 0) and assumption 2. 
(4)2. dblock[p).bal > b 

Proof: By (4)1 and assumption 4, which implies disk[dd][p].bal > 
b for all dd G D. 
(4)3. Q.E.D. 

Proof: By the level (3) case assumption, disk'[d] [p] = dblock[p], 
so (4)2 implies disk'[d][p].bal > b. 
(3)2. Case: disk'[d]\p] = disk[d\[p] 

Proof: In this case, assumption 4 and the level (2) assumption imply 
disk'[d][p] > b. 
(3)3. Q.E.D. 

Proof: By (3)1 and (3)2, since: 

HNext A (disk'[d][p] / disk[d][p}) => Phasel or2 Write (p, d) 
(2)2. Assume: 1. constants q G Proc, d £ D 

2. phase'[q] = 1 

3. dblock'[q].mbal > b 
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4. hasRead(q, d,p)' 
Prove: 3 br £ blocks Read' [q][d] : br. block. bal > b 
(3)1. phase[q] = 1 

Proof: By the level (2) assumptions 2 and 4, since: 

HNext A (phase'[q] / phase[q]) =>■ InitalizePhase(q) 
and InitalizePhase(q) implies ^hasRead(q, d,p)' . 
(3)2. dblock'[q].mbal = dblock[q].mbal 

Proof: By the level (2) assumption 4, since: 

HNext A (eWoc&'fg] / (i6/ocA:[g]) =>■ InitalizePhase(q) 
and InitalizePhase(q) implies ~^hasRead(q, d,p)' . 
(3)3. Case: Phaselor2Read(q,d,p) 

Proof: Assumption 4 implies and the level (2) assumption 2 imply 
disk[d][p].bal > b. The case assumption and the level (2) assump- 
tion 4 imply 

[block i ^ proc i— > p] G blocksRead' [q\ [d] 

proving the level (2) goal. 
(3)4. Case: -^Phaselor2Read(q, d,p) 
(4)1. hasRead(q, d, p) 

Proof: By the level (3) case assumption and the level (2) assump- 
tion 4, since: 

HNext A -<hasRead(q, d,p) A hasRead(q, d,p)' 
=>■ Phaselor2Read(q, d, p) 
(4)2. 3 6r G blocksRead[q] [d] : br.block.bal > b 

Proof: (3)1, (3)2 and the level (2) assumption 3 (which imply 
dblock[q].mbal > b), and assumption 4. 
(4)3. Q.E.D. 

Proof: By (4)2 and the level (2) assumption 4, since: 
HNext A hasRead(q, d, p)' =^ 

(blocks Read[q][d] C [e/]') 
(3)5. Q.E.D. 

Proof: By (3)3 and (3)4. 
(2)3. Q.E.D. 

Proof: (2)1 and (2)2 imply valueChosen(v)(b).2(p, D)' . 
(1)3. Q.E.D. 

Proof: By (1)1 and (1)2. 

We now prove Lemma I2f by proving: 

ASSUME: HInvl A HInv2 A HInv2' A HInv3 A HInvb A HInvQ A HNext 
Prove: HInvQ' 

(1)1. Assume: chosen' / NotAnlnput 
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Prove: valueChosen(chosen)' 
(2)1. Case: chosen = NotAnlnput 

(3)1. CHOOSE p G Proc S.T. EndPhaselor2(p) A (phase[p] = 2) 

Proof: HInv2.5, HInv2.5', and the level (1) and (2) assumptions 
imply the existence of a p G Proc such that: 

( ouipiti [p] = NotAnlnput) A (ouipui'[p] / NotAnOutput) 
By HNext, this implies EndPhaselor2(p) A (j?/iase[p] = 2). 
(3)2. maxBalInp(dblock[p].bal, dblock[p].inp) 
Proof: (3)1 implies 

]fl £ MajoritySet : Vd G D, g £ Proc : hasRead(p, d, q) 
Since any two majority sets have a disk in common, this implies 
^HInv5(p).R.b. Hence, HInv5 and (3)1 (which implies phase[p] = 2) 
imply HInvb(p).R.a. 
(3)3. maxBalInp(dblock[p].bal, chosen)' 
Proof: (3)1 implies 

(chosen = dblock[p].inp) A (dblock'[p].bal = dblock[p].bal) 
which by (3)2 implies maxBalInp(dblock'[p].bal, chosen'). Lemma 
BksOf and (3)1 imply maxBalInp(b, v)' = maxBalInp(b, v) for any 
constants b and v. 
(3)4. CHOOSE D G MajoritySet S.T. 

V d £ D , q £ Proc : hasRead(p, d, q) A (disk[d] [p] = dblock[p]) 
Proof: D exists by (3)1 and HInv2.2(p, d)A. 
(3)5. Assume: constants q e Proc, d e D s.t. 
A phase[q] = 1 

A dblock[q].mbal > dblock[p].bal 

A hasRead(q, d, p) 
Prove: [6/ocA; dblock[p], proc p] £ blocksRead[q][d] 
PROOF: (3)1 and HInv2.3(p).3 imply dblock[p].bal = dblock[p].mbal; 
HInv2.3(p).2.R.3 and the assumption dblock[q].mbal > dblock[p].bal 
then imply 

[6/ocA; i — ^ dblock[q], proc ^ q] ^ blocksRead[p][d] 
The result now follows from HInv3 and (3)4. 
(3)6. VgG Proc,d G D : 
A phase' [q] = 1 

A dblock'[q].mbal > dblock[p].bal 

A hasRead(q, d, p)' 
=^ (3 6r G blocksRead' [q\ [d] : br. block. bal = dblock[p].bal) 
Proof: By (3)5, since (3)1 implies that, if p ^ q, then phase[q], 
dblock[q], hasRead(q, d, p), and blocksRead are unchanged, for any 
disk d; and that phase' [q] = 1 implies p / q. 
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(3)7. Q.E.D. 

Proof: (3)3 implies valueChosen(chosen)' (dblock\p].bal).l; (3)6 im- 
plies valueChosen(chosen)' (dblock[p].bal).2(p, D). 
(2)2. Case: chosen / NotAnlnput 

(3)1. chosen' = chosen 

Proof: By HNext and the level (2) case assumption. 

(3)2. Q.E.D. 

Proof: The level (2) case assumption and HInvQ imply 

valueChosen(chosen) 
By Lemma VC and (3)1, this implies valueChosen(chosen)' . 
(2)3. Q.E.D. 

Proof: Immediate from (2)1 and (2)2. 
(1)2. Assume: constant p e Proc S.t. output'[p] / NotAnlnput 
Prove: output' [p] = chosen' 
(2)1. Case: chosen = NotAnlnput 

(3)1. V q £ Proc : output[q] = NotAnlnput 

Proof: By HInv2.5 and the level (2) case assumption. 
(3)2. Q.E.D. 

Proof: (3)1, the level (2) case assumption, and HNext imply that if 

output'[p] ^ NotAnlnput, then chosen' = output'[p]. 
(2)2. Case: chosen / NotAnlnput 
(3)1. valueChosen(chosen) 

Proof: By the level (2) case assumption and HInvQ.l. 
(3)2. valueChosen(chosen)' 

Proof: By (1)1, since the level (2) case assumption and HNext imply 

chosen' ^ NotAnlnput. 
(3)3. chosen' = chosen 

Proof: By (3)1, (3)2, and Lemma VC, since valueChosen(v) and 

valueChosen(w) imply v = w. 
(3)4. Case: output [p] = NotAnlnput 

(4)1. EndPhaselor2(p) A (phase[p\ = 2) 

Proof: By the level (1) assumption, the level (3) case assumption, 
and HNext. 

(4)2. 3D E MajoritySet : VgG Proc : hasRead(p, d, q) 
Proof: By (4)1 

(4)3. maxBalInp(dblock[p].bal, dblock[p].inp) 

Proof: By HInv5(p) and (4)1, since (4)2 implies ^HInv5(p).R.b 
(because any two majority sets have a disk in common). 
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(4)4. 3 bk e allBlocks, b G UNION {Ballot(p) : p G Proc} : 
A maxBalInp(b, chosen) 
A bk.bal > b 

Proof: By (3)1 and the definition of valueChosen. 
(4)5. dblock[p].inp = chosen 

Proof: By (4)3, (4)4, and the definition of maxBallnp. 
(4)6. Q.E.D. 

Proof: (3)3, (4)1 (which implies output'[p] = dblock[p].inp), and 
(4)5 imply output' [p] = chosen'. 
(3)5. Case: output[p] / NotAnlnput 

Proof: In this case, HInv2.3(p)A, the level (1) assumption, and 
HNext imply output'[p] = output[p]; and HInvQ.2 and (3)3 imply 
output' [p] = chosen'. 
(3)6. Q.E.D. 

Proof: By (3)4 and (3)5 
(2)3. Q.E.D. 

Proof: By (2)1 and (2)2 
(1)3. Q.E.D. 

Proof: HInv& follows immediately from (1)1 and (1)2. 
A.4.6 Theorem R2b 

We now prove Theorem R2b. First, we define IFail(p) and IChoose(p) 
to be the actions Fail(p) and Choose (p) from submodule Inner of module 
SynodSpec (with chosen and alllnput being the variables declared in the 
current context). 

Let: IFail(p) = IS (chosen, alllnput) \Fail(p) 

IChoose(p) = IS (chosen, alllnput)] IChoose 
Assume: HInv A HInv' A HNext 

Prove: (3p e Proc : IFail(p) V IChoose(p)) V (unchanged ivars) 

(1)1. Case: Bp e Proc : Fail(p) 

Proof: Fail(p) implies IFail(p).l and the existence of ip £ Inputs such 
that IFail(p) .2(ip) .1; and HNext then implies IFail(p).2(ip).2. We deduce 
IFail(p).3 from Fail(p)A, HNext and HInv2.5. 
(1)2. Case: Bp e Proc : (phase[p] = 2) A EndPhaselor2 
(2)1. CHOOSE p G Proc S.T. (phase[p] = 2) A EndPhase\or2 

Proof: p exists by the level (1) case assumption. 
(2)2. Case: chosen = NotAnlnput 

(3)1. V q £ Proc : output[q] = NotAnlnput 
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Proof: By the level (2) case assumption and HInv2.5. 
(3)2. V q £ Proc\{p} : output[q] = NotAnlnput 

Proof: (3)1 and (2)1. 
(3)3. chosen' = output' [p] 

Proof: By (3)2, (2)1 (which implies output'[p] / NotAnlnput), and 

EN ext. 
(3)4. Q.E.D. 

Proof: (3)1 implies IChoose(p).l; (2)1, (3)3, the level (2) case as- 
sumption, and HNext imply IChoose(p).2; and (2)1 and HInv2.5 im- 
ply IChoose(p).3. 
(2)3. Case: chosen / NotAnlnput 

(3)1. chosen' = chosen 

Proof: By HNext and the level (2) case assumption. 

(3)2. output'[p] = chosen 

Proof: By HInv6'.2, (2)1 (which implies output'[p] / NotAnlnput), 
and (3)1. 

(3)3. Q.E.D. 

Proof: (2)1 implies IChoose(p) .1; (3)1, (3)2 and the level (2) case 
assumption imply IChoose(p).2; and (2)1, HNext, and HInv2.5 imply 
IChoose(p).3. 
(2)4. Q.E.D. 

Proof: By (2)2 and (2)3. 
(1)3. Q.E.D. 

Proof: By (1)1 and (1)2, since 
HInv2.5 A HNext A {wars' / ivars) 
(input' / input) V (output' / output) 

and 

HNext A ((input' / input) V (output' / output)) =^ 

Bp £ Proc : Fail(p) V ((phase[p] = 2) A EndPhaselor2) 
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